Load the necessary libraries
library(car) #for regression diagnostics
library(broom) #for tidy output
library(ggfortify) #for model diagnostics
library(sjPlot) #for outputs
library(knitr) #for kable
library(effects) #for partial effects plots
library(emmeans) #for estimating marginal means
library(ggeffects) #for plotting marginal means
library(MASS) #for glm.nb
library(MuMIn) #for AICc
library(tidyverse) #for data wrangling
library(modelr) #for auxillary modelling functions
library(DHARMa) #for residual diagnostics plots
library(performance) #for residuals diagnostics
library(see) #for plotting residuals
library(patchwork) #grid of plots
library(scales) #for more scales
theme_set(theme_light())
An ecologist studying a rocky shore at Phillip Island, in southeastern Australia, was interested in how clumps of intertidal mussels are maintained (Quinn 1988). In particular, he wanted to know how densities of adult mussels affected recruitment of young individuals from the plankton. As with most marine invertebrates, recruitment is highly patchy in time, so he expected to find seasonal variation, and the interaction between season and density - whether effects of adult mussel density vary across seasons - was the aspect of most interest.
The data were collected from four seasons, and with two densities of adult mussels. The experiment consisted of clumps of adult mussels attached to the rocks. These clumps were then brought back to the laboratory, and the number of baby mussels recorded. There were 3-6 replicate clumps for each density and season combination.
Format of quinn.csv data files
| season | density | recruits | sqrtrecruits | group |
|---|---|---|---|---|
| Spring | Low | 15 | 3.87 | SpringLow |
| .. | .. | .. | .. | .. |
| Spring | High | 11 | 3.32 | SpringHigh |
| .. | .. | .. | .. | .. |
| Summer | Low | 21 | 4.58 | SummerLow |
| .. | .. | .. | .. | .. |
| Summer | High | 34 | 5.83 | SummerHigh |
| .. | .. | .. | .. | .. |
| Autumn | Low | 14 | 3.74 | AutumnLow |
| .. | .. | .. | .. | .. |
| season | Categorical listing of season in which mussel clumps were collected - independent variable |
| density | Categorical listing of the density of mussels within mussel clump - independent variable |
| recruits | The number of mussel recruits - response variable |
| sqrtrecruits | Square root transformation of recruits - needed to meet the test assumptions |
| groups | Categorical listing of season/density combinations - used for checking ANOVA assumptions |
Mussel
quinn = read_csv('../data/quinn.csv', trim_ws=TRUE) %>%
janitor::clean_names()
glimpse(quinn)
## Rows: 42
## Columns: 5
## $ season <chr> "Spring", "Spring", "Spring", "Spring", "Spring", "Sprin…
## $ density <chr> "Low", "Low", "Low", "Low", "Low", "High", "High", "High…
## $ recruits <dbl> 15, 10, 13, 13, 5, 11, 10, 15, 10, 13, 1, 21, 31, 21, 18…
## $ sqrtrecruits <dbl> 3.872983, 3.162278, 3.605551, 3.605551, 2.236068, 3.3166…
## $ group <chr> "SpringLow", "SpringLow", "SpringLow", "SpringLow", "Spr…
summary(quinn)
## season density recruits sqrtrecruits
## Length:42 Length:42 Min. : 0.00 Min. :0.000
## Class :character Class :character 1st Qu.: 9.25 1st Qu.:3.041
## Mode :character Mode :character Median :13.50 Median :3.674
## Mean :18.33 Mean :3.871
## 3rd Qu.:21.75 3rd Qu.:4.663
## Max. :69.00 Max. :8.307
## group
## Length:42
## Class :character
## Mode :character
##
##
##
Since we intend to model both season and density as categorical variables, we need to explicitly declair them as factors.
quinn <- quinn %>% janitor::clean_names() %>%
mutate(season = factor(season,
levels=c('Spring', 'Summer',
'Autumn', 'Winter')),
density = factor(density))
Model formula: \[ \begin{align} y_i &\sim{} \mathcal{Pois}(\lambda_i)\\ ln(\mu_i) &= \boldsymbol{\beta} \bf{X_i}\\[1em] \end{align} \]
where \(\boldsymbol{\beta}\) is a vector of effects parameters and \(\bf{X}\) is a model matrix representing the intercept and effects of season, density and their interaction on mussel recruitment.
ggplot(quinn, aes(y=recruits, x=season, fill=density)) +
geom_boxplot()
Conclusions:
Lets mimic the effect of using a log link, by using log scaled y-axis.
ggplot(quinn, aes(y=recruits, x=season, fill=density)) +
geom_boxplot() +
scale_y_log10()
Conclusions:
quinn.glmG <- glm(log(recruits + 1) ~ density*season, data = quinn, family = gaussian)
quinn.glm <- glm(recruits ~ density*season, data = quinn, family = poisson(link = 'log'))
autoplot(quinn.glm, which = 1:6)
Cook’s D is off, so although residuals and QQ are fine, there’s something wrong with the model
quinn.resid <- simulateResiduals(quinn.glm, plot = TRUE)
Problem with deviance = More variance than we should expect Model is overdispersed so the model is not reliable: potential because this si real data and there are more variables that explain recruitment, not just our variables. We can add an ‘unit level random effect’ to our model = add a variable that makes each variable unique and ‘soaks’ some of the variance. It will take one DF and shrinks the other estimates, making them less obvious. Maybe some of the zeros are not actually zeros because the recruits were to small to be counted (zero inflated model).
Can also test for dispersion:
testDispersion(quinn.resid) #red line is your supposed actual disperision, the histogram is what it actually should be
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 3.1726, p-value < 2.2e-16
## alternative hypothesis: two.sided
performance::check_overdispersion(quinn.glm) #3.309 = 3x more variable that there would normally be
## # Overdispersion test
##
## dispersion ratio = 3.309
## Pearson's Chi-Squared = 112.497
## p-value = < 0.001
Diagnosis = overdispersersion in the model (3x)
testZeroInflation(quinn.resid)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 7.6923, p-value = 0.04
## alternative hypothesis: two.sided
performance::check_zeroinflation(quinn.glm) #shows only 2 extra zeros than expected
## # Check for zero-inflation
##
## Observed zeros: 2
## Predicted zeros: 0
## Ratio: 0.00
Checking for overdispersion and zero inflation together actually provides a better picture. It showed that the model is overdispersed but it then showed that it’s not really zero inflated as 2 zeros are not much.
We’ll fit a negative binomial by dividing data into zeros and ones.
quinn.nb <- MASS::glm.nb(recruits ~ density*season, data = quinn)
quinn.resid <- simulateResiduals(quinn.nb, plot = TRUE)
Compare:
AICc(quinn.glm, quinn.nb)
plot_model(quinn.nb, type = 'eff', terms = c('season', 'density'))
summary(quinn.nb) #it's on a log-scale
##
## Call:
## MASS::glm.nb(formula = recruits ~ density * season, data = quinn,
## init.theta = 9.022467857, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8704 -0.8274 0.0000 0.4999 2.1866
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3026 0.1875 12.283 < 2e-16 ***
## densityLow 0.1133 0.2742 0.413 0.67934
## seasonSummer 1.5721 0.2389 6.581 4.69e-11 ***
## seasonAutumn 0.6763 0.2492 2.714 0.00664 **
## seasonWinter -0.5680 0.2881 -1.971 0.04870 *
## densityLow:seasonSummer -0.8970 0.3509 -2.556 0.01059 *
## densityLow:seasonAutumn -0.1881 0.3788 -0.496 0.61955
## densityLow:seasonWinter -0.8671 0.5338 -1.624 0.10432
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(9.0225) family taken to be 1)
##
## Null deviance: 183.269 on 41 degrees of freedom
## Residual deviance: 54.883 on 34 degrees of freedom
## AIC: 293.09
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 9.02
## Std. Err.: 3.69
##
## 2 x log-likelihood: -275.095
tidy(quinn.nb, conf.int = TRUE, exponentiate = TRUE)
N.B. if the model isn’t balanced, an ANOVA table will change the outputs depending on the order of the variables
Can use emmeans to do a pairwise to see the effect of each season
emmeans(quinn.nb, pairwise ~ density|season, type = 'response')
## $emmeans
## season = Spring:
## density response SE df asymp.LCL asymp.UCL
## High 10.00 1.87 Inf 6.93 14.44
## Low 11.20 2.24 Inf 7.57 16.58
##
## season = Summer:
## density response SE df asymp.LCL asymp.UCL
## High 48.17 7.13 Inf 36.03 64.39
## Low 22.00 3.55 Inf 16.03 30.19
##
## season = Autumn:
## density response SE df asymp.LCL asymp.UCL
## High 19.67 3.23 Inf 14.26 27.13
## Low 18.25 3.71 Inf 12.25 27.19
##
## season = Winter:
## density response SE df asymp.LCL asymp.UCL
## High 5.67 1.24 Inf 3.69 8.70
## Low 2.67 1.07 Inf 1.21 5.87
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## season = Spring:
## contrast ratio SE df z.ratio p.value
## High / Low 0.893 0.245 Inf -0.413 0.6793
##
## season = Summer:
## contrast ratio SE df z.ratio p.value
## High / Low 2.189 0.480 Inf 3.577 0.0003
##
## season = Autumn:
## contrast ratio SE df z.ratio p.value
## High / Low 1.078 0.282 Inf 0.286 0.7749
##
## season = Winter:
## contrast ratio SE df z.ratio p.value
## High / Low 2.125 0.973 Inf 1.646 0.0999
##
## Tests are performed on the log scale
#on an absolute scale instead of fractional:
emmeans(quinn.nb, ~density|season) %>%
regrid() %>%
pairs()
## season = Spring:
## contrast estimate SE df z.ratio p.value
## High - Low -1.20 2.92 Inf -0.411 0.6812
##
## season = Summer:
## contrast estimate SE df z.ratio p.value
## High - Low 26.17 7.97 Inf 3.284 0.0010
##
## season = Autumn:
## contrast estimate SE df z.ratio p.value
## High - Low 1.42 4.92 Inf 0.288 0.7734
##
## season = Winter:
## contrast estimate SE df z.ratio p.value
## High - Low 3.00 1.64 Inf 1.829 0.0673
newdata <- emmeans(quinn.nb, ~density|season, type = 'response') %>%
as.data.frame()
ggplot(data = newdata, mapping = aes(y = response, x = season, fill = density)) +
geom_pointrange(aes(ymin = asymp.LCL, ymax = asymp.UCL,
shape = density),
position = position_dodge(width = 0.1)) + #so that they don't overlap
theme_classic() +
theme(axis.title.x = element_blank(),
legend.position = c(0.01, 1), legend.justification = c(0, 1)) +
scale_shape_manual(values = c(21, 22))
Fitting a zero inflated model:
library(pscl)
quinn.zip <- zeroinfl(recruits ~ density*season | 1, data=quinn, dist='poisson') #the "|1" fits basically a null model
#quinn.resid <- simulateResiduals(quinn.zip, plot=TRUE) does not support zero inflated models
summary(quinn.zip)
##
## Call:
## zeroinfl(formula = recruits ~ density * season | 1, data = quinn, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.4510 -0.8645 0.1262 0.7225 2.8711
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3026 0.1291 17.836 < 2e-16 ***
## densityLow 0.1133 0.1858 0.610 0.54191
## seasonSummer 1.5721 0.1419 11.081 < 2e-16 ***
## seasonAutumn 0.6763 0.1586 4.266 1.99e-05 ***
## seasonWinter -0.5680 0.2147 -2.646 0.00815 **
## densityLow:seasonSummer -0.8970 0.2134 -4.202 2.64e-05 ***
## densityLow:seasonAutumn -0.1881 0.2381 -0.790 0.42957
## densityLow:seasonWinter 0.2165 0.4534 0.478 0.63300
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0037 0.7305 -4.112 3.92e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 13
## Log-likelihood: -151.3 on 9 Df
plogis(-3) #approximately 5% of zeros are not real zeros (-3 comes from the zero inflation model coefficients estimate), 5% is a small possibility and it makes sense according to our previous analyses
## [1] 0.04742587
#tidy(quinn.zip, conf.int=TRUE, exponentiate = TRUE)
exp(-3.0037)
## [1] 0.0496032
quinn.zip1 <- zeroinfl(recruits ~ density*season | season, data=quinn, dist='poisson') #was the rate any different in different seasons? ('|season')
summary(quinn.zip1)
##
## Call:
## zeroinfl(formula = recruits ~ density * season | season, data = quinn,
## dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -3.5327 -1.0591 0.1537 0.9216 3.6831
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3026 0.1291 17.836 < 2e-16 ***
## densityLow 0.1133 0.1858 0.610 0.54191
## seasonSummer 1.5721 0.1419 11.081 < 2e-16 ***
## seasonAutumn 0.6763 0.1586 4.266 1.99e-05 ***
## seasonWinter -0.5680 0.2147 -2.646 0.00815 **
## densityLow:seasonSummer -0.8970 0.2134 -4.202 2.64e-05 ***
## densityLow:seasonAutumn -0.1881 0.2381 -0.790 0.42958
## densityLow:seasonWinter 0.2291 0.4375 0.524 0.60045
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.157e+01 1.454e+04 -0.001 0.999
## seasonSummer -2.543e-08 2.013e+04 0.000 1.000
## seasonAutumn -2.119e-08 2.107e+04 0.000 1.000
## seasonWinter 2.031e+01 1.454e+04 0.001 0.999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 13
## Log-likelihood: -148 on 12 Df
exp(-3.0037)
## [1] 0.0496032
quinn.zinb <- zeroinfl(recruits ~ density*season | 1, data=quinn, dist='negbin')
AICc(quinn.zip, quinn.zinb)
summary(quinn.zinb)
##
## Call:
## zeroinfl(formula = recruits ~ density * season | 1, data = quinn, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.981e+00 -7.511e-01 -1.522e-05 5.289e-01 2.869e+00
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3026 0.1875 12.283 < 2e-16 ***
## densityLow 0.1133 0.2742 0.413 0.67933
## seasonSummer 1.5721 0.2389 6.580 4.69e-11 ***
## seasonAutumn 0.6763 0.2492 2.714 0.00664 **
## seasonWinter -0.5680 0.2881 -1.971 0.04869 *
## densityLow:seasonSummer -0.8969 0.3509 -2.556 0.01059 *
## densityLow:seasonAutumn -0.1881 0.3788 -0.496 0.61957
## densityLow:seasonWinter -0.8671 0.5338 -1.624 0.10432
## Log(theta) 2.1997 0.4091 5.378 7.55e-08 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.29 453.38 -0.034 0.973
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 9.0223
## Number of iterations in BFGS optimization: 43
## Log-likelihood: -137.5 on 10 Df
exp(-15.29) #none of them are false zeros because the negative binomial already handled the small amount of zeros
## [1] 2.288956e-07